Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent?
نویسنده
چکیده
Stochastic gradient descent (SGD) is a simple and very popular iterative method to solve stochastic optimization problems which arise in machine learning. A common practice is to return the average of the SGD iterates. While the utility of this is well-understood for general convex problems, the situation is much less clear for strongly convex problems (such as solving SVM). Although the standard analysis in the strongly convex case requires averaging, it was recently shown that this actually degrades the convergence rate, and a better rate is obtainable by averaging just a suffix of the iterates. The question we pose is whether averaging is needed at all to get optimal rates. We consider the problem of stochastically optimizing a convex function F over a convex domain W using stochastic gradient descent (SGD). The algorithm makes use of an oracle, which given some w ∈ W, returns a random vector ĝ whose expectation is a subgradient of F (w). For example, consider the linear SVM optimization problem over a training set {(xi, yi)}i=1, min w λ 2 ‖w‖ + 1 m m ∑ i=1 max{0, 1− yi〈xi,w〉}. Given some w, we can easily compute an unbiased estimate of its gradient, by picking a single example (xi, yi) uniformly at random, and returning a subgradient of λ 2‖w‖ + max{0, 1− yi〈xi,w〉}. SGD is parameterized by step sizes η1, . . . , ηT , and is defined as follows: we initialize w1 ∈ W arbitrarily. At each round t, we obtain an unbiased estimate ĝt of a subgradient of F (wt), and let wt+1 = ΠW(wt − ηtĝt), where ΠW is the projection operator on W. This algorithm produces a sequence of iterates w1, . . . ,wT . For general convex problems, it is well-known that if we pick ηt = Θ(1/ √ T ), and return the average of the iterates w̄T = (w1 + . . . ,wT )/T , then under mild conditions, F (w̄T )− infw∈W F (w) ≤ O(1/ √ T ), both in expectation and in high probability 1− δ (with logarithmic dependence on δ). We focus here on cases where F is strongly convex roughly speaking, that it can be lower bounded at any point in W by a quadratic function (as in the case of SVM optimization). When F is strongly convex, one can obtain faster convergence rates. For example, by picking ηt = 1/t, the expected suboptimality of w̄T is O(log(T )/T ) (Hazan et al. (2007); Shalev-Shwartz et al. (2011)). Recently, Rakhlin et al. (2012) showed that in fact, one can get rid of the log(T ) factor, and get an optimal 1/T rate for step sizes ηt = Θ(1/T ), by replacing the simple average w̄T
منابع مشابه
On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging
This paper considers stochastic subgradient mirror-descent method for solving constrained convex minimization problems. In particular, a stochastic subgradient mirror-descent method with weighted iterate-averaging is investigated and its per-iterate convergence rate is analyzed. The novel part of the approach is in the choice of weights that are used to construct the averages. Through the use o...
متن کاملNon-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
In this paper, we consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research communit...
متن کاملMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T )/T ), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T ) rate. This mig...
متن کاملStochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes
Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required nontrivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate t...
متن کاملOptimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections
We consider stochastic strongly convex optimization with a complex inequality constraint. This complex inequality constraint may lead to computationally expensive projections in algorithmic iterations of the stochastic gradient descent (SGD) methods. To reduce the computation costs pertaining to the projections, we propose an Epoch-Projection Stochastic Gradient Descent (Epro-SGD) method. The p...
متن کامل